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Abstract 

We introduce a technique to rapidly generate snnnned-area tallies using graphics hardware. Summed area ta- 
bles, originally introduced by Crow, provide a way to filter arbitrarily huge rectangular regions of an image in 
a constant amount of time. Our algorithm for generating summed area tables, similar to a technique used in sci- 
entific computing called recursive doubling, allows the generation of a summed area table in Ollog n) time. We 
also describe a technique to mitigate the precision requirements of summed-area tables. The ability to calculate 
and use summed-area tables at interactive rates enables numerous interesting rendering effects. We present sev- 
eral possible application',. First, the use of summed-area tables allows real-time rendering of interactive, glossy 
environmental reflections. Second, we present os >lanar i h varying >'» s , pendent on a 

reflected object's distance to the reflector. Third, we show a technique that uses a summed-area table to render 
glossy transparent objects. The final application demonstrates an interactive depth-of-field effect using summed- 
area tables. 

Categories and Subject Descriptors (according to ACM CCS): 1.3.7 [Computer Graphics]: Three-Dimensional 
Graphics and Realism: Color, shading, shadowing, and texture 1.4.3 [Image Processing and Computer Vision]: 
Enhancement: Filtering 



1. Introduction 

There are many applications in computer graphics where 
spatially varying filters are useful. One example is the ren- 
dering of glossy reflections. Unlike perfectly reflective ma- 
terials, which only require a single radiance sample in the 
direction of the reflection vector, glossy materials require 
integration over a solid angle. Blurring by filtering the re- 
flected image with a support dependent on the surface's 
BRDF can approximate this effect. This is currently done 
by pre-filtering off line, which limits the technique to static 

Crow [Cro84] introduced summed-area tables to enable 
more general texture filtering than was possible with mip 
maps. Once generated, a summed-area table provides a 
means to evaluate a spatially varying box filter in a constant 
number of texture reads. Heckbert [Hec86] extended Crow's 
work to handle complex filter functions. 

In this paper we present a method to rapidly generate 
summed-area tables that is efficient enough to allow mul- 
tiple tables to be generated every frame while maintaining 
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Figure 1: An image illustrating the use of a summed-area 
table to render glossy planar reflections where the blurri- 
ness of an object varies depending on its distance from the 
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interactive frame rates. We demonstrate the applicability of 
spatially varying filters for real-time, interactive computer 
graphics through four different applications. 

The paper is organized as follows: Section 2 provides 
background information on summed-area tables. Next, in 
Section 3 we present our technique for generating summed- 
area tables. In Section 4 we describe a method for alleviat- 
ing the precision requirements of summed-area tables. Sec- 
tion 5 presents several example applications using summed- 
area tables for real-time graphics followed by a discussion 
of summed-area table performance. Then future work is pre- 
sented in Section 7, and Section 8 concludes the paper. 




2. Background 

Originally introduced by Crow [Cro84] as an alternative to 
mip maps, a summed-area table is an array in which each 
entry holds the sum of the pixel values between the sample 
location and the bottom left corner of the corresponding in- 
put image. 

Summed-area tables enable the rapid calculation of the 
sum of the pixel values in an arbitrarily sized, axis-aligned 
rectangle at a fixed computational cost. Figure 2 illustrates 
how a summed-area table is used to compute the sum of the 
values of pixels spanning a rectangular region. To find the 
integral of the values in the dark rectangle, we begin with 
the pre-computed integral from (0,0) to (x^?, y^). We subtract 
the integrals of the rectangles (0, 0) to (x R , y B ) and (0, 0) to 
(xl, yr)- The integral of the hatched box is then added to 
compensate for having been subtracted twice. 

The average value of a group of pixels can be calculated 
by dividing the sum by the area. Crow's technique amounts 
to convolution of an input image with a box filter. The power 
lies in the fact that the filter support can be varied at a per 
pixel level without increasing the cost of the computation. 
Unfortunately, since the value of the sums (and thus the dy- 
namic range) can get quite large, the table entries require 
extended precision. The number of bits of precision needed 
per component is 

P s = log z {w)+log 2 {h)+Pi 

where w and h are the width and height of the input image. 
P s is the precision required to hold values in the summed- 
area table, and P, is the number of bits of precision of the 
input. Thus, a 256x256 texture with 8-bit components would 
require a summed-area table with 24 bits of storage per com- 
ponent. 

Another limitation of Crow's summed-area table tech- 
nique is that it is only capable of implementing a simple 
box filter. This is because only the sum of the input pixels is 
stored; therefore it is not possible to directly apply a generic 
filter by weighting the inputs. 



In [Hec 



Heckbert extended the theory behind 



Figure 2: (after [Crow84]) An entry in the summed-area ta- 
ble holds the sum of the values from the lower left corner of 
the image to the current location. To compute the sum of the 
dark rectangular region, evaluate T[Xr,Yj] — T[Xr,Yb] — 
T[X L ,Y T ] + T[X L ,Y B ] where T is the value of the entry at 
(x, y) 

summed-area tables to handle more complex filter func- 
tions. Heckbert made two key observations. The first is that 
a summed-area table can be viewed as the integral of the 
input image, and the second that the sample function intro- 
duced by Crow was the same as the derivative of the box fil- 
ter function. By taking advantage of those observations and 
the following convolution identity 

f(g ) g = f'"(g ) J g 

it is possible to extend summed-area tables to handle higher 
order filter functions, such as the Bartlett filter, or even a 
Catmull-Rom spline filter. The process is essentially one of 
repeated box filtering. Higher order filters approach a Gaus- 
sian, and exhibit fewer artifacts. 

For instance, Bartlett filtering requires taking the second- 
order box filter, and weighting it with the following coeffi- 



Unfortunately, a direct implementation of the Bartlett filter- 
ing example requires 44 bits of precision per component, as- 
suming 8-bits per component and a 256x256 input image. 

In general, the precision requirements of Heckbert's 
method can be determined as follows: 

P s =n*(log2(w) + log 2 (h))+Pi 

where w and h are the width and height of the input texture, 
n is the degree of the filter function, Pj is the input image's 
precision, and P s is the required precision of the n' ft -degree 
summed-area table. 
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Figure 3: The recursive doubling algorithm tit ID. On the 

first pass, the value one element to the left is added to the 
current value. On the second pass, the value two elements 
to the left is added the current value. In general, the stride 
is doubled for each pass. The output is an array whose ele- 
ments are the sum of all of the elements to the left, computed 
in O(log n) time. 

A technique introduced by [AG02, YP03] combines mul- 
tiple samples from different levels of a mip map to approx- 
imate filtering. This technique suffers from several prob- 
lems. First, a small step in the neighborhood around a pixel 
does not necessarily introduce new data to the filter; it only 
changes the weights of the input values. Second, when the 
inputs do change, a large amount of data changes at the same 
time, due to the mip map, which causes noticeable artifacts. 
In [Dem04], the authors added noise in an attempt to make 
the artifacts less noticeable; the visual quality of the result- 
ing images was noticeably reduced. 

3. Summed-Area Table Generation 

In order to efficiently construct summed-area tables, we bor- 
row a technique, called recursive doubling [DR77], often 
used in high-performance and parallel computing. Using re- 
cursive doubling, a parallel gather operation amongst n pro- 
cessors can be performed in only log2{n) steps, where a sin- 
gle step consists of each processor passing its accumulated 
result to another processor. 

In a similar manner, our method uses the GPU to accu- 
mulate results so that only 0(log n) passes are needed for 
summed-area table construction. To simplify the following 
description, we assume that only two texels can be read per 
pass. Later in the discussion we explain how to generalize 
the technique to an arbitrary number of texture reads per 

Our algorithm proceeds in two phases: first a horizontal 
phase, then a vertical phase. During the horizontal phase, 
results are accumulated along scan lines, and during the 
vertical phase, results are accumulated along columns of 
pixels. The horizontal phase consists of n passes, where 
n = ceil(log2(imai>ewhltli ) I. and the vertical phase consist 
of m passes, where m = ceil(log2(imageheight)). 

For each pass we render a screen-aligned quad that covers 
all pixels that do not yet hold their final sum and execute a 



fragment program on every covered pixel. The input image is 
stored in a texture named t A . In the first pass of the horizon- 
tal phase we read two texels from t A : the one corresponding 
to the pixel currently being computed and the one to the im- 
mediate left. Both are added together and stored into texture 
tB- 

For the second pass, we swap our textures so that we are 
reading from t B and writing to t A . Now the fragment pro- 
gram adds the texels corresponding to the one currently be- 
ing computed and the one two pixels to the left. t A now holds 
the sum of four pixels. 

The third pass repeats this scheme, now reading from t A 
and writing to fg and summing two texels four pixels apart, 
resulting in the sum of eight pixels in t B . This progression 
continues for the rest of the horizontal passes until all pixels 
are summed up in the horizontal direction. Note that in pass 
the leftmost 2' pixels already hold their final sum for the hor- 
izontal phase and thus are not covered by the quad rendered 
in this pass. Next the vertical phase proceeds in an analogous 
manner. Figure 3 shows the horizontal passes needed to con- 
struct a summed-area table of a 4x4 image. The following 
pseudo-code summarizes the algorithm. 

t A = Inputlmage 
n = log2(width) 
m = log2(height) 
II horizontal phase 
for(i = 0;i<n-i = i+l) 

t B [x,y]=t A [x,y]+t A [x + 2',y] 

swap(t A ,t B ) 

II vertical phase 

for{i = 0;i<m;i = i+\) 

t B [x,y]=tA[x,y]+t A [x,y + 2'} 

swap(t A ,t B ) 

II Texture t A holds the result 

In practice, reading more than two texels per fragment, 
per pass is possible, which reduces the number of passes 
required to generate a summed-area table. Our current im- 
plementation supports reading 2, 4, 8, or 16 texels per frag- 
ment, per pass. This allows trading per-pass complexity with 
the number of rendering passes required. Adding 16 texels 
per pass enables us to generate a summed-area table from 
a 256x256 image in only four passes, two for the horizon- 
tal phase, and two for the vertical phase. As shown in Sec- 
tion 6, adjusting the per-pass complexity helps in optimiz- 
ing summed-area generation speed for different input tex- 
ture sizes. The following is the pseudo-code to generate a 
summed-area table when r reads per fragment are possible. 
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t A = Input Image 
n = log r {width) 
m = log,-{height) 

hori:antal phase 
for(i = 0;i<n-J = i+l) 
t B [x,y] = 
t A [x,y}+ 
t A [x+l*r',y] + 
t A [x + 2*r-,y]+ 
■■■ + 

t A [x + r*r\y] 
swap(t A ,t B ) 
II vertical phase similar la 
II horizonal phase 
II Texture t A holds the result 

Note that near the left and bottom image borders the frag- 
ment program will fetch texels outside the image regions. 
To ensure correct summation of the image pixels, the texture 
units must be configured to use clamp to border color mode 
with the border color set to 0. This way texel fetches outside 
the image boundaries will not affect the sum. Alternatively, 
it is possible to render a black border around the input image 
and configure the texture units to use clamp to edge mode. 

We have implemented our algorithm in both Direct3D and 
OpenGL, with similar results. In the OpenGL implementa- 
tion we used a double buffered pbuffer to mitigate the cost 
of context switches. Instead of switching context between 
each pass, we simply swap the front and back buffers of the 
pbuffer. This allows us to efficiently ping-pong between two 
textures as results are accumulated. If implemented at the 
driver level, similar to the way that automatic mip-map gen- 
eration is done, the costs of the passes could be mitigated 



4. Improving Computational Precision 

A key challenge to the usefulness of the summed-area table 
approach is the loss of numerical precision, which can lead 
to significant noise in the resultant image. This section first 
discusses the source of such precision loss and then presents 
our approach to mitigating this problem. Example images 
are provided that demonstrate how our approach achieves 
significant reduction in noise: up to 31 dB improvement in 

4.1. Source of Precision Loss 

One source of precision loss could come from the GPU's 
floating point implementation since current graphics hard- 
ware does not implement IEEE standard 754 floating point 
but, as shown by Hillesland [Hil05], current GPU implemen- 
tations behave reasonably well. 

The summed-area table approach can exhibit significant 
noise because certain steps in the algorithm involve com- 
puting the difference between two relatively large finite- 



precision numbers with very close values. This is especially 
true for pixels in the upper right portion of the image be- 
cause the monotonically increasing nature of the summed- 
area function implies that the table values for that region are 
all quite high. 

As an example, consider the images of Fig. 4, which are 
256x256 images with 8-bit components. The middle and 
right columns show the image after being filtered through an 
"identity filter," i.e., a 1-bit filter kernel that is ideally sup- 
posed to produce a resultant image that is a replica of the 
original image. To avoid loss of computational precision, a 
summed-area table with 24 bits of storage per component 
per pixel would be sufficient, since the maximum summed- 
area value at any pixel cannot exceed 256x256x256. How- 
ever, the summed-area table used in this example used only 
16 and 24 bit FP values. As a result, significant noise is seen 
in the filtered image, with worsening image quality in the 
direction of increasing xy. 

4.2. Our Approach to Improving Precision 

In order to mitigate the loss of computational precision, our 
approach modifies the original summed-area table computa- 

4.2.1. Using Signed-Offset Pixel Representation 

The first modification is to represent pixel values in the orig- 
inal image as signed floating-point values (e.g., values in the 
range -0.5 to 0.5), as opposed to the traditional approach that 
uses unsigned pixel values (from 0.0 to 1.0). 

This modification improves precision in two ways: (i) 
there is a 1-bit gain in precision because the sign bit now 
becomes useful, and (ii) the summed-area function becomes 
non-monotonic, and therefore the maximum value reached 
has a relatively lower magnitude. 

We have investigated two distinct methods for converting 
the original image to a signed-offset representation: (i) cen- 
tering the pixel values around the 50% gray level, and (ii) 
centering them around the mean image pixel value. The for- 
mer involves less computational overhead and gives good 
precision improvement, but the latter provides even better 
results with modest computational overhead. 

Centering around 50% gray level. This method modifies 
the original image by subtracting 0.5 from the value at ev- 
ery pixel, thereby making the pixel values lie in the -0.5 
to 0.5 range. The summed-area table computation proceeds 
as usual, but with the understanding that the table entry at 
pixel position (x,y) will now be 0.5xy less than the actual 
summed-area value. The net impact is a significant gain in 
precision because the table entries now have significantly 
lower magnitudes, and therefore computing the differences 
yields a greater precision result. 

Fig. 4 demonstrates the usefulness of this approach. The 
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Figure 4: The left column .shims the original input images, the middle column are reconstructions from simmied-area tables 
I SATs) generated using our method, and the right column are reconstructions from SATs generated with the old method. For 
the first row, the SATs are constructed using 16 bit floats, for the second row the SATs are constructed using 24 hit floats, and 
the final row shows a zoomed version of second row (region of interest highlighted) 



first row shows three versions of a checkerboard. The im- 
age on the right, generated using the traditional method, ex- 
hibits unacceptable noise throughout much of the image. In 
contrast, the middle image, generated by our method, barely 
shows error. 

Centering around image pixel average. While centering 
pixel values around the 50% gray level proved to be quite 
useful, an even better approach is to store offsets from the 
image's average pixel value. This is especially true of images 



such as Lena for which the image average can be quite differ- 
ent from 50% gray. For such images, centering around 50% 
gray could still result in sizable magnitudes at each pixel po- 
sition, thereby increasing the probability that the summed- 
area values could appreciably grow in magnitude. Centering 
the pixel values around the actual image average guarantees 
that the summed-area value is equal to 0 both at the origin 
and at the upper right corner (modulo floating-point round- 
ing errors). 
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The computational overhead of this approach is fairly 
modest as the image average is easily computed in hardware 
using mip mapping. 

4.2.2. Using Origin-Centered Image Representation 

The second modification involves anchoring the origin of the 
coordinate system to the center of the image, instead of to the 
bottom-left image corner. In effect, this simple modification 
reduces in half the maximum values of x and y over which 
summed areas are accumulated. As a result, for a given pre- 
cision level, images of double the width and height can be 
handled. 



5. Example Applications 

Since our technique is fast enough to generate summed-area 
tables every frame, their use becomes feasible to generate 
real-time, interactive effects. We present four example appli- 
cations. The first is a method to generate glossy environmen- 
tal reflections. The second application uses a summed-area 
table to render glossy planar reflections, where the blurri- 
ness of an object's reflection varies depending on its distance 
from the reflecting plane. The third application presents 
a technique to render glossy transparency, and finally, the 
fourth application, previously presented by Greene [Gre03], 
renders images with a depth-of-field effect. We believe that 
these applications are a compelling demonstration of the 
power of real-time summed-area table g 



5.1. Glossy Environmental Reflections 

In [KVHSOO], Kautz et al. presented a method for real-time 
rendering of glossy reflections for static scenes. They ren- 
dered a dual-paraboloid environment map and pre-filtered 
it in an offline process. Instead of pre-filtering, we create a 
summed-area table for each face of a dual-paraboloid map 
on the fly, and use them to filter the environment map at 



environmental 



Figure 6: An object textured using fan 
of summed-area tables generated frot 

run time. This enables real-time 
glossy reflections for dynamic sc 

Figure 5 is an image of an object where the e 
map has been filtered with a spatially varying filter function; 
in this case the filter support has been modulated by another 
texture. The image is rendered in real time, at a rate of over 
60 frames per second. The filter function, scene geometry 
and environment map can change every frame. 

There are several compelling reasons for using dual- 
paraboloid environment mapping over the more commonly 
used cube mapping. First, Kautz et al. showed that when fil- 
tering in image space, as opposed to filtering over the solid 
angles, a dual-paraboloid environment map has lower er- 
ror than a cube map or a spherical map. Second, it is only 
necessary to generate two summed-area tables as opposed 
to six summed-area tables. Finally, for large filters, a dual- 
paraboloid map will require data from only two textures, 
whereas it is possible that data might be required from all 
six faces of a cube map. 

5.1.1. Box Filtering 

A coarse approximation to a glossy BRDF is a simple box 
filter. A single box-filter evaluation takes four texture reads 
from the summed-area table. Two evaluations are required 
on current hardware when a filter is supported by both the 
front and the back of a dual-paraboloid map. On future hard- 
ware it may be possible to conditionally evaluate the filters 
for both maps only when necessary. 

As is common when storing a spherical map in a square 
texture, our implementation uses the alpha channel to mark 
the pixels that are in the dual-paraboloid map. A pixel is con- 
sidered to be in the map if its alpha value is one. We also 
use the alpha value to count the area covered by the filter. 
After combining the result of the evaluation from the front 
and back maps, the alpha channel holds the total count of 
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Figure 7: A set of four box filters stacked to approximate a 
Phong BRDF. 

summed texels, which is then used to normalize the filter 

The basic algorithm for rendering glossy environmental 
reflections is 

renderCubeMap( ); 

generate I'araboloidMapFromCubeMapO; 
generateSummedAreaTablei trontMap t: 

1 Uii ti Map ) 
setitph'MttfeGoi ■rdinaleGeneraliont ): 

renderScene 

{ 

f oreach fragment on reflective object: 
{ 

front = evaluates AT (Fronts AT. f titer _size); 
front = evaluateSAT(BackSATJilter_siz.e): 

computer tiller area 
filtered. alpha = front. alpha + back.alpha; 

combine front and back color 
result = front i- back: 

1 1 divide by the area of the filter 
result/ = filtered, alpha: 

coin put eFinalC<dor( filtered): 

} 

} 

While our current implementation creates a dual- 
paraboloid map from a cube map, it is possible to directly 
generate the dual-paraboloid map by using a vertex program 
to project the scene geometry as in [CHL04]. 

5.1.2. Box Filtering 

More complex filter functions can be constructed at the cost 
of more texture reads by stacking multiple box filters on top 
of each other. The stacked boxes approximate the shape of 
smoother filters. For a single summed-area table, each fil- 
ter in the stack requires eight texture reads (four for each of 



Figure 8: Example oftranslucency using a summed-area ta- 
ble to filter the view seen through the gloss. 

the front and back maps). So a complex filter created from 
a stack of four box filters would perform thirty-two texture 
reads per fragment. 

Both OpenGL and Direct3D provide a means to automati- 
cally generate texture coordinates based on the normal direc- 
tion and reflection direction. By combining box filters gener- 
ated from both the reflection direction and the normal direc- 
tion, it is possible to compute an approximation of the Phong 
BRDF. Figure 7 shows an image generated using a stack of 
two large box filters centered on the normal direction to ap- 
proximate the diffuse component of the Phong BRDF and 
a stack of two smaller box filters centered on the reflection 
direction to approximate the specular component. 

5.2. Glossy Planar Reflections 

Since the summed-area table enables filtering with arbi- 
trary support, it is relatively easy to render glossy reflections 
where the blurriness of an object varies depending on the 
distance of the reflected object from the reflector. This ef- 
fect is often seen when an object is placed on a glossy table 
top. The object's reflection is much sharper where the object 
and table top meet than elsewhere. Figure 1 shows an image 
where the floor is a glossy reflector, and the blurriness of the 
reflection depends on the object's distance from the floor. 

The effect is accomplished by augmenting the standard 
planar reflection algorithm. The pass for rendering the re- 
flected scene from the virtual viewpoint outputs both the 
color and the distance to the reflection plane to a texture. 
A summed-area table is generated from the color data. Then 
the planar reflector is rendered from the summed-area ta- 
ble, using the previously saved distance to modulate the filter 

5.3. Translucency 

Approaches to rendering translucent materials include those 
of [Arv95,Die96]. We are able to render real-time interac- 
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Figure 9: Simulated depth of field. 

tive translucent objects using a summed-area table; this tech- 
nique can be used to render such effects as etched and milky 
glass. Figure 8 shows a scene with multiple translucent ob- 

The effect is achieved by first rendering the scene, to tex- 
ture memory, without the translucent objects. A summed- 
area table is generated from the resulting image. Then we 
render the translucent objects with a fragment program that 
uses the summed-area table to blur the regions of the scene 
behind the objects. 

5.4. Depth of Field 

In [Gre03], Greene presents a technique to render an im- 
age with a depth-of-field effect using summed-area tables. 
His summed-area table generation technique is problematic 
since it requires that a texture be read from and written to 
at the same time. Unfortunately graphics hardware — due 
to its parallel streaming architecture — makes no guarantees 
about the execution sequence of read-modify-write opera- 



In [Demers04], a technique to render a depth of field ef- 
fect was presented that used mip maps to approximate a sim- 
ple filter. Because of the artifacts introduced by the mip-map 
filtering technique, the authors add noise to reduce the per- 
ceptible Mach bands. 

Unlike mip maps, summed-area tables are able to aver- 
age arbitrary rectangular sections of an image, allowing us 
to implement a real-time, interactive version of the depth- 
of-field effect, without having to add noise to mask filtering 
artifacts. However, our implementation does have the same 
drawbacks as other image filtering techniques for generating 
a depth-of-field effect, such as the bleeding of sharp in-focus 
objects onto blurry backgrounds. Figure 9 shows an image 
rendered with depth of field. A 1024x768 image renders at 
a rate of 23 frames per second. A lower resolution version 
renders at much higher frame rates. 

The effect is accomplished by first rendering the scene 





Summed-area table size 




256x256 


512x512 


1024x1024 


Radeon 








9800 XT 1 


3.1 ms (8) 


14.2 ms (4) 


70.1ms (4) 


Radeon 








X800XT PE 1 


1.4 ms (8) 


7.3 ms (4) 


36.2 ms (4) 


Geforce 








6800 Ultra 2 


4.3 ms (8) 


32.4 ms (4) 


95.3 ms (4) 



'24 -bit floats 2 32- bit floe 

Table 1: Shortest time to generate summed-area tables of 
different sizes. The number of samples per pass are given in 
parentheses. 





Summed-area table size 


Samples/pass 


256x256 


512x512 


1024x1024 


2 


2.3 ms 


9.9 ms 


44.3 ms 


4 


1.8 ms 


7.3 ms 


36.2 ms 


8 


1.4 ms 


9.9 ms 


45.6 ms 


16 


2.7 ms 


12.4 ms 


53.3 ms 



Table 2: Time to generate smmncil-area tables of dif 
ent sizes using diff a numhei ■ \ eiples per pass o 
Radeon X800XT Plaiiimiii Edition graphics card. 



from the camera's point-of-view and saving the color and 
depth buffers to texture memory. Next a summed-area table 
is generated from the saved color buffer. As in [Dem04], the 
depth buffer is used to determine the circle of confusion. Fi- 
nally, a screen-filling quad is rendered, and a fragment pro- 
gram is used to blur the color buffer based on the circle of 
confusion. 

6. Summed-Area Table Generation Performance 

Table 1 shows the time required to generate summed-area 
tables of different sizes on a number of graphics cards using 
DirectX 9. We list the shortest time we could achieve for 
each card and input size along with the number of samples 
per pass used to get the best performance. Table 2 shows 
performance based on input size and the number of samples 
per pass for one of the cards used in our test. 

Our benchmark results show that finding a good balance 
between the number of rendering passes and the amount of 
work performed during each pass is important for the overall 
performance of summed-area table generation. The optimal 
tradeoff between the number of passes and per-pass cost is 
largely dependent on the overhead of render target switches 
and the design of the texture cache on the target platform. 

Computing summed-area tables directly on the graphics 
card is better than performing this computation on the CPU 
for several reasons. First, the input data is already present 
in GPU memory. Transferring the data to the CPU for pro- 
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cessing and then and back again would put an unnecessary 
burden on the bus and can easily become a bottleneck be- 
cause many graphics drivers are unable to reach full theoret- 
ical bandwidth utilization when reading back data from the 
GPU [GPU]. Moreover, moving data back and forth between 
GPU and CPU would break GPU-CPU parallelism because 
each processor would end up waiting for new results from 
the other processor. 

In our opinion, the particularly good performance of 
generating 256x256 summed-area table on modern graph- 
ics hardware makes dynamic glossy reflections using dual- 
paraboloid maps (as outlined in Section 5) very feasible. 

7. Future Work 

In the future we plan to quantify how closely a set of stacked 
box filters can approximate an arbitrary BRDF, and develop 
a set of criteria to generate the box-filter stack that best rep- 
resents a given BRDF. While the techniques presented in 
this paper substantially reduce the precision requirements of 
summed-area tables, work is needed on techniques to reduce 
them even further. Doing so will make it feasible to generate 
second and third order summed-area tables, which would al- 
low more complex filter functions, such as a Barlett filter or 
a parabolic filter. 

8. Conclusion 

We have introduced a technique to rapidly generate 
summed-area tables, which enable constant-time space vary- 
ing box filtering. This capability can be used to simulate a 
variety of effects. We demonstrate glossy environmental re- 
flections, glossy planar reflections, translucency, and depth 
of field. 

The biggest drawback to summed-area tables is the high 
demand they make on numerical precision. To ameliorate 
this problem, we have developed some techniques to more 
effectively use the limited precision available on current 
graphics hardware. 
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